## 'data.frame': 113937 obs. of 81 variables:
## $ ListingKey : Factor w/ 113066 levels "00003546482094282EF90E5",..: 7180 7193 6647 6669 6686 6689 6699 6706 6687 6687 ...
## $ ListingNumber : int 193129 1209647 81716 658116 909464 1074836 750899 768193 1023355 1023355 ...
## $ ListingCreationDate : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 14184 111894 6429 64760 85967 100310 72556 74019 97834 97834 ...
## $ CreditGrade : Factor w/ 9 levels "","A","AA","B",..: 5 1 8 1 1 1 1 1 1 1 ...
## $ Term : int 36 36 36 36 36 60 36 36 36 36 ...
## $ LoanStatus : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
## $ ClosedDate : Factor w/ 2803 levels "","2005-11-25 00:00:00",..: 1138 1 1263 1 1 1 1 1 1 1 ...
## $ BorrowerAPR : num 0.165 0.12 0.283 0.125 0.246 ...
## $ BorrowerRate : num 0.158 0.092 0.275 0.0974 0.2085 ...
## $ LenderYield : num 0.138 0.082 0.24 0.0874 0.1985 ...
## $ EstimatedEffectiveYield : num NA 0.0796 NA 0.0849 0.1832 ...
## $ EstimatedLoss : num NA 0.0249 NA 0.0249 0.0925 ...
## $ EstimatedReturn : num NA 0.0547 NA 0.06 0.0907 ...
## $ ProsperRating..numeric. : int NA 6 NA 6 3 5 2 4 7 7 ...
## $ ProsperRating..Alpha. : Factor w/ 8 levels "","A","AA","B",..: 1 2 1 2 6 4 7 5 3 3 ...
## $ ProsperScore : num NA 7 NA 9 4 10 2 4 9 11 ...
## $ ListingCategory..numeric. : int 0 2 0 16 2 1 1 2 7 7 ...
## $ BorrowerState : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
## $ Occupation : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
## $ EmploymentStatus : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
## $ EmploymentStatusDuration : int 2 44 NA 113 44 82 172 103 269 269 ...
## $ IsBorrowerHomeowner : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
## $ CurrentlyInGroup : Factor w/ 2 levels "False","True": 2 1 2 1 1 1 1 1 1 1 ...
## $ GroupKey : Factor w/ 707 levels "","00343376901312423168731",..: 1 1 335 1 1 1 1 1 1 1 ...
## $ DateCreditPulled : Factor w/ 112992 levels "2005-11-09 00:30:04.487000000",..: 14347 111883 6446 64724 85857 100382 72500 73937 97888 97888 ...
## $ CreditScoreRangeLower : int 640 680 480 800 680 740 680 700 820 820 ...
## $ CreditScoreRangeUpper : int 659 699 499 819 699 759 699 719 839 839 ...
## $ FirstRecordedCreditLine : Factor w/ 11586 levels "","1947-08-24 00:00:00",..: 8639 6617 8927 2247 9498 497 8265 7685 5543 5543 ...
## $ CurrentCreditLines : int 5 14 NA 5 19 21 10 6 17 17 ...
## $ OpenCreditLines : int 4 14 NA 5 19 17 7 6 16 16 ...
## $ TotalCreditLinespast7years : int 12 29 3 29 49 49 20 10 32 32 ...
## $ OpenRevolvingAccounts : int 1 13 0 7 6 13 6 5 12 12 ...
## $ OpenRevolvingMonthlyPayment : num 24 389 0 115 220 1410 214 101 219 219 ...
## $ InquiriesLast6Months : int 3 3 0 0 1 0 0 3 1 1 ...
## $ TotalInquiries : num 3 5 1 1 9 2 0 16 6 6 ...
## $ CurrentDelinquencies : int 2 0 1 4 0 0 0 0 0 0 ...
## $ AmountDelinquent : num 472 0 NA 10056 0 ...
## $ DelinquenciesLast7Years : int 4 0 0 14 0 0 0 0 0 0 ...
## $ PublicRecordsLast10Years : int 0 1 0 0 0 0 0 1 0 0 ...
## $ PublicRecordsLast12Months : int 0 0 NA 0 0 0 0 0 0 0 ...
## $ RevolvingCreditBalance : num 0 3989 NA 1444 6193 ...
## $ BankcardUtilization : num 0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
## $ AvailableBankcardCredit : num 1500 10266 NA 30754 695 ...
## $ TotalTrades : num 11 29 NA 26 39 47 16 10 29 29 ...
## $ TradesNeverDelinquent..percentage. : num 0.81 1 NA 0.76 0.95 1 0.68 0.8 1 1 ...
## $ TradesOpenedLast6Months : num 0 2 NA 0 2 0 0 0 1 1 ...
## $ DebtToIncomeRatio : num 0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
## $ IncomeRange : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
## $ IncomeVerifiable : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
## $ StatedMonthlyIncome : num 3083 6125 2083 2875 9583 ...
## $ LoanKey : Factor w/ 113066 levels "00003683605746079487FF7",..: 100337 69837 46303 70776 71387 86505 91250 5425 908 908 ...
## $ TotalProsperLoans : int NA NA NA NA 1 NA NA NA NA NA ...
## $ TotalProsperPaymentsBilled : int NA NA NA NA 11 NA NA NA NA NA ...
## $ OnTimeProsperPayments : int NA NA NA NA 11 NA NA NA NA NA ...
## $ ProsperPaymentsLessThanOneMonthLate: int NA NA NA NA 0 NA NA NA NA NA ...
## $ ProsperPaymentsOneMonthPlusLate : int NA NA NA NA 0 NA NA NA NA NA ...
## $ ProsperPrincipalBorrowed : num NA NA NA NA 11000 NA NA NA NA NA ...
## $ ProsperPrincipalOutstanding : num NA NA NA NA 9948 ...
## $ ScorexChangeAtTimeOfListing : int NA NA NA NA NA NA NA NA NA NA ...
## $ LoanCurrentDaysDelinquent : int 0 0 0 0 0 0 0 0 0 0 ...
## $ LoanFirstDefaultedCycleNumber : int NA NA NA NA NA NA NA NA NA NA ...
## $ LoanMonthsSinceOrigination : int 78 0 86 16 6 3 11 10 3 3 ...
## $ LoanNumber : int 19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
## $ LoanOriginalAmount : int 9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
## $ LoanOriginationDate : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...
## $ LoanOriginationQuarter : Factor w/ 33 levels "Q1 2006","Q1 2007",..: 18 8 2 32 24 33 16 16 33 33 ...
## $ MemberKey : Factor w/ 90831 levels "00003397697413387CAF966",..: 11071 10302 33781 54939 19465 48037 60448 40951 26129 26129 ...
## $ MonthlyLoanPayment : num 330 319 123 321 564 ...
## $ LP_CustomerPayments : num 11396 0 4187 5143 2820 ...
## $ LP_CustomerPrincipalPayments : num 9425 0 3001 4091 1563 ...
## $ LP_InterestandFees : num 1971 0 1186 1052 1257 ...
## $ LP_ServiceFees : num -133.2 0 -24.2 -108 -60.3 ...
## $ LP_CollectionFees : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_GrossPrincipalLoss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_NetPrincipalLoss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_NonPrincipalRecoverypayments : num 0 0 0 0 0 0 0 0 0 0 ...
## $ PercentFunded : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Recommendations : int 0 0 0 0 0 0 0 0 0 0 ...
## $ InvestmentFromFriendsCount : int 0 0 0 0 0 0 0 0 0 0 ...
## $ InvestmentFromFriendsAmount : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Investors : int 258 1 41 158 20 1 1 1 1 1 ...
It seems that there is only a slight proportion of the loans that went past due. However, there are still a considerable proportion of the loans that has defaulted or charged off. This keeps me wondering, is the borrower’s status as a home owner a factor in determining the loan status? Or in other words, does the borrower being a home owner improve the chances of loan being paid off?
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12.00 36.00 36.00 40.83 36.00 60.00
By the Prosper API, we know that possible term values are 36 and 60 months for 3 and 5 years, respectively. So the majority of the loans are for 3 years.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 1.00 2.00 17.06 5.00 3672.00
Note that as shown on the graph, the vast majority of borrowerAPR is around 0.36 ~ 0.37, with a pure count of 3672. Amazing! After having skimmed through their webpage and API, I wonder if the borrowerAPR has anything to do with the borrower’s state of residence and employment status because those are the factors based on which the borrower rate of interest will be calculated.
The graph above shows the 10 states that have the largest number of borrowers, with their respective counts of borrowers. Now, it becomes clear that, surprisingly enough, California has the largest number of borrowers. But this finding, weird as it might seem on first sight, does make sense if one gives much thought into the subject matter: California these days has had a surge in commercial activity, especially small IT startups, so it makes sense for those small firms to borrow money.
Similarly, this plot shows the 10 types of occupation that have the most number of borrowers. Notice that without much of a surprice, the type with a tag of ‘Other’ has earned the trophy for this one. The type with the second most borrowers is called ‘Professional’, probably meaning company executives or sales reps.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -0.183 0.074 0.092 0.096 0.117 0.284 29084
The plot above shows ten of the largest estimated returns within the dataset and their respective counts. From the graph, it can be easily deduced that return that have the largest number of occurrences is somewhere between 0.12 and 0.13 and very close to 0.125. Now by the summary statistics, we know that this number is above the 3rd quartile of the dataset, which is a quite surprising finding on its own.
Now as the graph shows, the vast majority of borrowers are employed. This indeed comes as unintuitive though, because the image of an average borrower is that of a cashless person asking for money.
As we can see, the peak of the distribution happens to be somewhere around 5 and the near majority of the distribution is under 10 and this is much anticipated given everyday common sense.
It seems that there are considerably less borrowers with the lowest rating (AA), and the highest rating (HR). The vast majority of the borrowers tend to have ratings that are in the middle of the spectrum (namely, B, C, and D).
The plot above summarises the total number of inquiries made by each borrower by count. After reading the plot, it becomes clear that the vast majority of borrowers have had less than 20 inquiries. Considering that Prosper was launched in 2005 and has not grown to be very popular, it’s a pretty reasonable number. Note that since the original graph turns out to be a little bit right-skewed, I log-transformed the graph and the result seems more beautiful and viewable.
As shown by the plot, the incomes of the vast majority of the borrowers are between $25000 and $75000.
As shown by the graph, the vast majority of the loan amount is between 0 and 10000.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 2.00 44.00 80.48 115.00 1189.00
The majority of the number of investors for loans are below 100, with a peak value appearing just above 0, most likely below 10.
Tip: Now that you’ve completed your univariate explorations, it’s time to reflect on and summarize what you’ve found. Use the questions below to help you gather your observations and add your own if you have other thoughts!
The dataset contains 113937 observations (loans) with 81 characteristics. Most borrowers are from California, professional, have an annual percentage rate of approx. 0.365, an income between $25000 and $75000, have made less than 20 inquiries in total, and have an average credit rating.
The main features of interest in my dataset, in my opinion, are any info about the borrowers, like the state of origin, occupation, interest rate, income, etc, and its relationship with the loans themselves, like current delinquent months, the number of investors, etc.
Features like investor income, investor state, and investor credit scores will help my investigation.
no.
Yes. The totalInquiries plot is a bit right-skewed, so in order to fix it, I log transformed it to make it look better.
Loans with a term-length of 36 tend to have more investors and larger variation since loans of 3 years in general tend to have larger chance of generating stable returns.
As seen on the graph, apart from the ‘other’ bar, the occupation with the largest estimated return is ‘Professional’, for example business startups. Our profession, computer programmer, has an average estimated return.
Surprisingly, Iowa has the greatest average number of investors, which begs a big question of why? In the meantime, according to the second graph, intially states labelled with blu-ish color were the majority, then they became the minority while green-ish colored states emerged victorious and became the majority.
Surprisingly, there are returns of value below zero. And more surprisingly, loans with a status of ‘completed’ have the most fraction of below zero loans. I wonder what factors contribute to the resolvability of loans.
States like California, New York, Arizona have the most number of TotalInquiries, implying an active financial market in those states.
As would be expected given previous results, California has the greatest spread in estimated return, but surprisingly not the greatest average return. This time, the trophy goes to states like Alaska, Tennessee, and South Dakota, which is quite an result.
Tip: As before, summarize what you found in your bivariate explorations here. Use the questions below to guide your discussion.
California has the greatest spread in estimated return, and states like Alaska, Tennessee, and South Dakota have the relatively large maximum average return.
States like California, New York, Arizona have the most number of TotalInquiries, implying an active financial market in those states.
Loans with a status of ‘completed’ have the most fraction of below zero loans.
Iowa has the greatest average number of investors,
no.
I didn’t conduct any correlational investigations???but it seems that estimated returns and loan status have a very strong relationship.
## Warning: Removed 29084 rows containing missing values (geom_path).
## Warning: Removed 9 rows containing missing values (geom_path).
As shown in the graph, loan statuses with a green-ish color have a greater chance of receiving more investors and hence more spread in the variable, as would be expected from combining the variables from the last section.
California has the greatest number of TotalInquiries by borrowers, and the largest percentage of green-ish colored status.
loans with a term length of 36 months have the most number of investors and the most proportion of high rate of returns
For relationships, see the above. There weren’t any features that strengthened each other.
nothing that weren’t expected given the previous explorations
## Warning: Removed 9 rows containing missing values (geom_path).
The plot above shows the relationship between borrower state, Loan Status, and Total Inquiries. The interesting point about it is that it shows that California has the greatest number of Total Inquiries, and also the greatest number (read: percentage) of completed loans. This illustrates the full power of California as a state
## Warning: Removed 84 rows containing missing values (geom_path).
The plot above shows the relationship of borrower occupation and lender estimated return. It shows the market potential for each occupation. Computer programmer only scored somewhere in the middle, while ‘professional’ scored the highest.
## Warning: Removed 1 rows containing missing values (position_stack).
## Warning: position_stack requires non-overlapping x intervals
The plot illustrates, on the borrowers’ side, whether the loans are worth it by showing their APR. The plot shows that the vast majority of borrowers have their APR between 0.1 ~ 0.3, with a peak value appearing at arount 0.365. Thus on average, the loans have a reasonable APR.
The biggest struggle that I had to go through while exploring the data was that it was very hard to interpret the data. I didn’t understand what on earth each column is talking about (except some pretty obvious ones). I tried to search for answers on the Internet for those specific terms and some got its answers, and some didn’t. Then I thought, Prosper is a company, implying that it might have an online API! So I turned instead to the official API of Prosper, and it answered the vast majority of the answers that I was having. The one thing that went well, however, is the coding part because I refreshed (read: promptly learned) R syntax and the package documentations. The most surprising finding, as I am an incoming junior year CS major at University of Toronto, was that computer programmer borrowers only had an expected return in the middle for lenders. That was a hugh blow to the eye… (laugh) The one insight that I can give the dataset, however, is that the parameters have to be simplified, or rather, not so technical. It took me several days to research what those vaguely explicit parameters mean.